46 research outputs found

    Exploring the topical structure of short text through probability models : from tasks to fundamentals

    Get PDF
    Recent technological advances have radically changed the way we communicate. Today’s communication has become ubiquitous and it has fostered the need for information that is easier to create, spread and consume. As a consequence, we have experienced the shortening of text messages in mediums ranging from electronic mailing, instant messaging to microblogging. Moreover, the ubiquity and fast-paced nature of these mediums have promoted their use for unthinkable tasks. For instance, reporting real-world events was classically carried out by news reporters, but, nowadays, most interesting events are first disclosed on social networks like Twitter by eyewitness through short text messages. As a result, the exploitation of the thematic content in short text has captured the interest of both research and industry. Topic models are a type of probability models that have traditionally been used to explore this thematic content, a.k.a. topics, in regular text. Most popular topic models fall into the sub-class of LVMs (Latent Variable Models), which include several latent variables at the corpus, document and word levels to summarise the topics at each level. However, classical LVM-based topic models struggle to learn semantically meaningful topics in short text because the lack of co-occurring words within a document hampers the estimation of the local latent variables at the document level. To overcome this limitation, pooling and hierarchical Bayesian strategies that leverage on contextual information have been essential to improve the quality of topics in short text. In this thesis, we study the problem of learning semantically meaningful and predictive representations of text in two distinct phases: • In the first phase, Part I, we investigate the use of LVM-based topic models for the specific task of event detection in Twitter. In this situation, the use of contextual information to pool tweets together comes naturally. Thus, we first extend an existing clustering algorithm for event detection to use the topics learned from pooled tweets. Then, we propose a probability model that integrates topic modelling and clustering to enable the flow of information between both components. • In the second phase, Part II and Part III, we challenge the use of local latent variables in LVMs, specially when the context of short messages is not available. First of all, we study the evaluation of the generalization capabilities of LVMs like PFA (Poisson Factor Analysis) and propose unbiased estimation methods to approximate it. With the most accurate method, we compare the generalization of chordal models without latent variables to that of PFA topic models in short and regular text collections. In summary, we demonstrate that by integrating clustering and topic modelling, the performance of event detection techniques in Twitter is improved due to the interaction between both components. Moreover, we develop several unbiased likelihood estimation methods for assessing the generalization of PFA and we empirically validate their accuracy in different document collections. Finally, we show that we can learn chordal models without latent variables in text through Chordalysis, and that they can be a competitive alternative to classical topic models, specially in short text.Els avenços tecnològics han canviat radicalment la forma que ens comuniquem. Avui en dia, la comunicació és ubiqua, la qual cosa fomenta l’ús de informació fàcil de crear, difondre i consumir. Com a resultat, hem experimentat l’escurçament dels missatges de text en diferents medis de comunicació, des del correu electrònic, a la missatgeria instantània, al microblogging. A més de la ubiqüitat, la naturalesa accelerada d’aquests medis ha promogut el seu ús per tasques fins ara inimaginables. Per exemple, el relat d’esdeveniments era clàssicament dut a terme per periodistes a peu de carrer, però, en l’actualitat, el successos més interessants es publiquen directament en xarxes socials com Twitter a través de missatges curts. Conseqüentment, l’explotació de la informació temàtica del text curt ha atret l'interès tant de la recerca com de la indústria. Els models temàtics (o topic models) són un tipus de models de probabilitat que tradicionalment s’han utilitzat per explotar la informació temàtica en documents de text. Els models més populars pertanyen al subgrup de models amb variables latents, els quals incorporen varies variables a nivell de corpus, document i paraula amb la finalitat de descriure el contingut temàtic a cada nivell. Tanmateix, aquests models tenen dificultats per aprendre la semàntica en documents curts degut a la manca de coocurrència en les paraules d’un mateix document, la qual cosa impedeix una correcta estimació de les variables locals. Per tal de solucionar aquesta limitació, l’agregació de missatges segons el context i l’ús d’estratègies jeràrquiques Bayesianes són essencials per millorar la qualitat dels temes apresos. En aquesta tesi, estudiem en dos fases el problema d’aprenentatge d’estructures semàntiques i predictives en documents de text: En la primera fase, Part I, investiguem l’ús de models temàtics amb variables latents per la detecció d’esdeveniments a Twitter. En aquest escenari, l’ús del context per agregar tweets sorgeix de forma natural. Per això, primer estenem un algorisme de clustering per detectar esdeveniments a partir dels temes apresos en els tweets agregats. I seguidament, proposem un nou model de probabilitat que integra el model temàtic i el de clustering per tal que la informació flueixi entre ambdós components. En la segona fase, Part II i Part III, qüestionem l’ús de variables latents locals en models per a text curt sense context. Primer de tot, estudiem com avaluar la capacitat de generalització d’un model amb variables latents com el PFA (Poisson Factor Analysis) a través del càlcul de la likelihood. Atès que aquest càlcul és computacionalment intractable, proposem diferents mètodes d estimació. Amb el mètode més acurat, comparem la generalització de models chordals sense variables latents amb la del models PFA, tant en text curt com estàndard. En resum, demostrem que integrant clustering i models temàtics, el rendiment de les tècniques de detecció d’esdeveniments a Twitter millora degut a la interacció entre ambdós components. A més a més, desenvolupem diferents mètodes d’estimació per avaluar la capacitat generalizadora dels models PFA i validem empíricament la seva exactitud en diverses col·leccions de text. Finalment, mostrem que podem aprendre models chordals sense variables latents en text a través de Chordalysis i que aquests models poden ser una bona alternativa als models temàtics clàssics, especialment en text curt.Postprint (published version

    Event detection in location-based social networks

    Get PDF
    With the advent of social networks and the rise of mobile technologies, users have become ubiquitous sensors capable of monitoring various real-world events in a crowd-sourced manner. Location-based social networks have proven to be faster than traditional media channels in reporting and geo-locating breaking news, i.e. Osama Bin Laden’s death was first confirmed on Twitter even before the announcement from the communication department at the White House. However, the deluge of user-generated data on these networks requires intelligent systems capable of identifying and characterizing such events in a comprehensive manner. The data mining community coined the term, event detection , to refer to the task of uncovering emerging patterns in data streams . Nonetheless, most data mining techniques do not reproduce the underlying data generation process, hampering to self-adapt in fast-changing scenarios. Because of this, we propose a probabilistic machine learning approach to event detection which explicitly models the data generation process and enables reasoning about the discovered events. With the aim to set forth the differences between both approaches, we present two techniques for the problem of event detection in Twitter : a data mining technique called Tweet-SCAN and a machine learning technique called Warble. We assess and compare both techniques in a dataset of tweets geo-located in the city of Barcelona during its annual festivities. Last but not least, we present the algorithmic changes and data processing frameworks to scale up the proposed techniques to big data workloads.This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract (TIN2015-65316), by the Severo Ochoa Program (SEV2015-0493), by SGR programs of the Catalan Government (2014-SGR-1051, 2014-SGR-118), Collectiveware (TIN2015-66863-C2-1-R) and BSC/UPC NVIDIA GPU Center of Excellence.We would also like to thank the reviewers for their constructive feedback.Peer ReviewedPostprint (author's final draft

    Scaling DBSCAN-like algorithms for event detection systems in Twitter

    Get PDF
    The increasing use of mobile social networks has lately transformed news media. Real-world events are nowadays reported in social networks much faster than in traditional channels. As a result, the autonomous detection of events from networks like Twitter has gained lot of interest in both research and media groups. DBSCAN-like algorithms constitute a well-known clustering approach to retrospective event detection. However, scaling such algorithms to geographically large regions and temporarily long periods present two major shortcomings. First, detecting real-world events from the vast amount of tweets cannot be performed anymore in a single machine. Second, the tweeting activity varies a lot within these broad space-time regions limiting the use of global parameters. Against this background, we propose to scale DBSCAN-like event detection techniques by parallelizing and distributing them through a novel density-aware MapReduce scheme. The proposed scheme partitions tweet data as per its spatial and temporal features and tailors local DBSCAN parameters to local tweet densities. We implement the scheme in Apache Spark and evaluate its performance in a dataset composed of geo-located tweets in the Iberian peninsula during the course of several football matches. The results pointed out to the benefits of our proposal against other state-of-the-art techniques in terms of speed-up and detection accuracy.Peer ReviewedPostprint (author's final draft

    Social review-based recommender systems from theory to practice

    Get PDF
    Premi al millor PFC en l'Àrea de Sistemes de la informació d'Enginyeria de Telecomunicació o d'Enginyeria Electrònica de l'ETSETB-UPC (curs 2013-2014). Atorgat per Cátedra Red.esSocial Recommender Systems were born with the goal to mitigate the current information overload caused by the birth of Social Networks among other causes. They have enabled Internet actors (e.g. users, web browsers, sensors, actuators, etc.) to make more informed decisions based on the information that is been shown to them, up to the point that some actors even blindly trust the recommendation generated by these systems. Within this scenario, this thesis proposes a novel Hybrid Social Recommender System purely based on the text reviews typed by users. The proposed engine treats the review content and sentiment separately and finally, combines both into a single recommendation. Very little scientific research has been published on mining text reviews with the aim of performing item recommendation. Moreover, among all Hybrid Recommendation Systems in the literature, none use the above-mentioned review features into a collaborative and content-based recommender. With the purpose in mind of assessing the platform effectiveness, we present a methodology that goes from the process of extracting the data directly from a Social Network, cleaning and pre-processing the text data, building the predictive model with different state-of-the art machine learning techniques, up to the point of evaluating the system in terms of several key metrics. The data extraction process gains our attention due to the challenges imposed by most social platforms in obtaining all the geo-positioned data generated in a bounded region. To overcome the platform limitations, we introduce the use of the Quadtree algorithm with the goal of crawling all the geo-positioned reviews. The algorithm is enhanced with a module that copes with the time dynamics and captures the time-stamped data as well. Moreover, we study the effectiveness of the Quadtree partition method to crawl any type of spatial data, which tends to be softly distributed in the area. This thesis draws several conclusions from the available data about the use of several state-of-the art text mining techniques and the effectiveness of the proposed recommender setup. Nonetheless, future work needs to design and propose novel evaluation methodologies that uncouple the system evaluation from the data.Award-winnin

    Mining urban events from the tweet stream through a probabilistic mixture model

    Get PDF
    The geographical identification of content in Social Networks have enabled to bridge the gap between online social platforms and the physical world. Although vast amounts of data in such networks are due to breaking news or global occurrences, local events witnessed by users in situ are also present in these streams and of great importance for many city entities. Nowadays, unsupervised machine learning techniques, such as Tweet-SCAN, are able to retrospectively detect these local events from tweets. However, these approaches have limited abilities to reason about unseen observations in a principled way due to the lack of a proper probabilistic foundation. Probabilistic models have also been proposed for the task, but their event identification capabilities are far from those of Tweet-SCAN. In this paper, we identify two key factors which, when combined, boost the accuracy of such models. As a first key factor, we notice that the large amount of meaningless social data requires explicitly modeling non-event observations.Therefore, we propose to incorporate a background model that captures spatio-temporal fluctuations of non-event tweets. As a second key factor, we observe that the shortness of tweets hampers the application of traditional topic models. Thus, we integrate event detection and topic modeling, assigning topic proportions to events instead of assigning them to individual tweets. As a result, we propose Warble, a new probabilistic model and learning scheme for retrospective event detection that incorporates these two key factors. We evaluate Warble in a data set of tweets located in Barcelona during its festivities. The empirical results show that the model outperforms other state-of-the-art techniques in detecting various types of events while relying on a principled probabilistic framework that enables to reason under uncertainty.This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract (TIN2015-65316), by the Severo Ochoa Program (SEV2015-0493), by SGR programs of the Catalan Government (2014-SGR-1051, 2014-SGR-118), Collectiveware (TIN2015-66863-C2-1-R) and BSC/UPC NVIDIA GPU Center of Excellence.We would also like to thank the reviewers for their constructive feedback.Peer ReviewedPostprint (author's final draft

    Tweet-SCAN: an event discovery technique for geo-located tweets

    Get PDF
    Twitter has become one of the most popular Location-based Social Networks (LBSNs) that bridges physical and virtual worlds. Tweets, 140-character-long messages, are aimed to give answer to the What’s happening? question. Occurrences and events in the real life (such as political protests, music concerts, natural disasters or terrorist acts) are usually reported through geo-located tweets by users on site. Uncovering event-related tweets from the rest is a challenging problem that necessarily requires exploiting different tweet features. With that in mind, we propose Tweet-SCAN, a novel event discovery technique based on the popular density-based clustering algorithm called DBSCAN. Tweet-SCAN takes into account four main features from a tweet, namely content, time, location and user to group together event-related tweets. The proposed technique models textual content through a probabilistic topic model called Hierarchical Dirichlet Process and introduces Jensen–Shannon distance for the task of neighborhood identification in the textual dimension. As a matter of fact, we show Tweet-SCAN performance in two real data sets of geo-located tweets posted during Barcelona local festivities in 2014 and 2015, for which some of the events were identified by domain experts beforehand. Through these tagged data sets, we are able to assess Tweet-SCAN capabilities to discover events, justify using a textual component and highlight the effects of several parameters.Peer ReviewedPostprint (author's final draft

    Un ABP basado en la robótica para las ingenierías informáticas

    Get PDF
    Estos últimos años estamos asistiendo a una disminución en la entrada de alumnos en las titulaciones tecnológicas debida a la falta de motivación de los estudiantes por unos estudios técnicos que implican un gran esfuerzo para su superación. Por otra parte, los alumnos que entran en los primeros cursos se encuentran a menudo con un conjunto de materias distintas a las esperadas y las perciben desconectadas entre sí. Ambas circunstancias contribuyen de forma importante a la disminución de la promoción de estudiantes en las titulaciones técnicas. Este artículo presenta un ABP (Aprendizaje Basado en Problemas) que utiliza la robótica. Actualmente, la experiencia se está desarrollando en l’Escola Tècnica Superior d’Enginyeria y l’Escola Tècnica en Informàtica de Sabadell. El objetivo principal es el de aumentar la motivación de los alumnos que llegan a primer curso de la titulación sin que por ello se disminuyan los contenidos técnicos estipulados en los planes de estudio vigentes. La novedad principal con respecto a experiencias anteriores radica en que los alumnos trabajaran aspectos de software y hardware de modo coordinado, utilizando un robot como arquitectura común. El artículo se organiza de la siguiente forma: en la introducción se explica la situación actual y los objetivos del trabajo, luego se exponen ciertos trabajos que son precedentes en los que esta basada nuestra experiencia, en el punto 3 se desarrolla la planificación del ABP, discute sobre las competencias que el ABP puede añadir en estas primeras materias del plan de estudios y después presenta la implantación del ABP.Peer Reviewe

    The value of repeat biopsy in lupus nephritis flares

    Full text link
    Whether a repeat renal biopsy is helpful during lupus nephritis (LN) flares remains debatable. In order to analyze the clinical utility of repeat renal biopsy in this complex situation, we retrospectively reviewed our series of 54 LN patients who had one or more repeat biopsies performed only on clinical indications. Additionally, we reviewed 686 well-documented similar cases previously reported (PubMed 1990-2015). The analysis of all patients reviewed showed that histological transformations are common during a LN flare, ranging from 40% to 76% of cases. However, the prevalence of transformations and the clinical value of repeat biopsy vary when they are analyzed according to proliferative or nonproliferative lesions. The great majority of patients with class II (78% in our series and 77.5% in the literature review) progressed to a higher grade of nephritis (classes III, IV, or V), resulting in worse renal prognosis. The frequency of pathological conversion in class V is lower (33% and 43%, respectively) but equally clinically relevant, since almost all cases switched to a proliferative class. Therefore, repeat biopsy is highly advisable in patients with nonproliferative LN at baseline biopsy, because these patients have a reasonable likelihood of switch to a proliferative LN that may require more aggressive immunosuppression. In contrast, the majority of patients (82% and 73%) with proliferative classes in the reference biopsy (III, IV or mixed III/IV + V), remained into proliferative classes on repeat biopsy. Although rebiopsy in this group does not seem as necessary, it is still advisable since it will allow us to identify the 18% to 20% of patients that switch to a nonproliferative class. In addition, consistent with the reported clinical experience, repeat biopsy might also be helpful to identify selected cases with clear progression of proliferative lesions despite the initial treatment, for whom it is advisable to intensify inmunosuppression. Thus, our experience and the literature data support that repeat biopsy also brings more advantges than threats in this group. The results of the repeat biopsy led to a change in the immunosuppresive treatment in more than half of the patients on average, intensifying it in the majority of the cases, but also reducing it in 5% to 30%

    Estrogen and COVID-19 symptoms: Associations in women from the COVID Symptom Study

    Get PDF
    It has been widely observed that adult men of all ages are at higher risk of developing serious complications from COVID-19 when compared with women. This study aimed to investigate the association of COVID-19 positivity and severity with estrogen exposure in women, in a population based matched cohort study of female users of the COVID Symptom Study application in the UK. Analyses included 152,637 women for menopausal status, 295,689 women for exogenous estrogen intake in the form of the combined oral contraceptive pill (COCP), and 151,193 menopausal women for hormone replacement therapy (HRT). Data were collected using the COVID Symptom Study in May-June 2020. Analyses investigated associations between predicted or tested COVID-19 status and menopausal status, COCP use, and HRT use, adjusting for age, smoking and BMI, with follow-up age sensitivity analysis, and validation in a subset of participants from the TwinsUK cohort. Menopausal women had higher rates of predicted COVID-19 (P = 0.003). COCP-users had lower rates of predicted COVID-19 (P = 8.03E-05), with reduction in hospital attendance (P = 0.023). Menopausal women using HRT or hormonal therapies did not exhibit consistent associations, including increased rates of predicted COVID-19 (P = 2.22E-05) for HRT users alone. The findings support a protective effect of estrogen exposure on COVID-19, based on positive association between predicted COVID-19 with menopausal status, and negative association with COCP use. HRT use was positively associated with COVID-19, but the results should be considered with caution due to lack of data on HRT type, route of administration, duration of treatment, and potential unaccounted for confounders and comorbidities

    Illness duration and symptom profile in symptomatic UK school-aged children tested for SARS-CoV-2.

    Get PDF
    BACKGROUND: In children, SARS-CoV-2 infection is usually asymptomatic or causes a mild illness of short duration. Persistent illness has been reported; however, its prevalence and characteristics are unclear. We aimed to determine illness duration and characteristics in symptomatic UK school-aged children tested for SARS-CoV-2 using data from the COVID Symptom Study, one of the largest UK citizen participatory epidemiological studies to date. METHODS: In this prospective cohort study, data from UK school-aged children (age 5-17 years) were reported by an adult proxy. Participants were voluntary, and used a mobile application (app) launched jointly by Zoe Limited and King's College London. Illness duration and symptom prevalence, duration, and burden were analysed for children testing positive for SARS-CoV-2 for whom illness duration could be determined, and were assessed overall and for younger (age 5-11 years) and older (age 12-17 years) groups. Children with longer than 1 week between symptomatic reports on the app were excluded from analysis. Data from symptomatic children testing negative for SARS-CoV-2, matched 1:1 for age, gender, and week of testing, were also assessed. FINDINGS: 258 790 children aged 5-17 years were reported by an adult proxy between March 24, 2020, and Feb 22, 2021, of whom 75 529 had valid test results for SARS-CoV-2. 1734 children (588 younger and 1146 older children) had a positive SARS-CoV-2 test result and calculable illness duration within the study timeframe (illness onset between Sept 1, 2020, and Jan 24, 2021). The most common symptoms were headache (1079 [62·2%] of 1734 children), and fatigue (954 [55·0%] of 1734 children). Median illness duration was 6 days (IQR 3-11) versus 3 days (2-7) in children testing negative, and was positively associated with age (Spearman's rank-order rs 0·19, p<0·0001). Median illness duration was longer for older children (7 days, IQR 3-12) than younger children (5 days, 2-9). 77 (4·4%) of 1734 children had illness duration of at least 28 days, more commonly in older than younger children (59 [5·1%] of 1146 older children vs 18 [3·1%] of 588 younger children; p=0·046). The commonest symptoms experienced by these children during the first 4 weeks of illness were fatigue (65 [84·4%] of 77), headache (60 [77·9%] of 77), and anosmia (60 [77·9%] of 77); however, after day 28 the symptom burden was low (median 2 symptoms, IQR 1-4) compared with the first week of illness (median 6 symptoms, 4-8). Only 25 (1·8%) of 1379 children experienced symptoms for at least 56 days. Few children (15 children, 0·9%) in the negatively tested cohort had symptoms for at least 28 days; however, these children experienced greater symptom burden throughout their illness (9 symptoms, IQR 7·7-11·0 vs 8, 6-9) and after day 28 (5 symptoms, IQR 1·5-6·5 vs 2, 1-4) than did children who tested positive for SARS-CoV-2. INTERPRETATION: Although COVID-19 in children is usually of short duration with low symptom burden, some children with COVID-19 experience prolonged illness duration. Reassuringly, symptom burden in these children did not increase with time, and most recovered by day 56. Some children who tested negative for SARS-CoV-2 also had persistent and burdensome illness. A holistic approach for all children with persistent illness during the pandemic is appropriate. FUNDING: Zoe Limited, UK Government Department of Health and Social Care, Wellcome Trust, UK Engineering and Physical Sciences Research Council, UK Research and Innovation London Medical Imaging and Artificial Intelligence Centre for Value Based Healthcare, UK National Institute for Health Research, UK Medical Research Council, British Heart Foundation, and Alzheimer's Society
    corecore